Web Document Classification based on Tagged-Region Progressive Analysis

نویسندگان

Li-Chun Sung

Meng Chang Chen

Chin-Hwa Kuo

Hsiu-Chuan Hsu

چکیده

In this paper, we propose an intelligent web document classification method, called TAgged-Region Progressive Analysis (TARPA). Instead of parsing the whole content of the web page while classifying a web document, TARPA parses the document into finer structured Tagged-Regions and extracts fewer and the most important regions to analyze and classify. If the few important tagged regions are not sufficient to allow TARPA to classify the document, other important regions and linked pages can be used for analysis progressively to enhance the classification performance. TARPA possesses two stages: learning stage and classification stage. The learning stage discriminates the importance of tags or pairs of tags, and the classification stage follows the importance order of tags to analyze the document. As a result, TARPA can classify a web document using few contents while with higher classification rate and shorter processing time. Experiments show that 94% of the testing web documents can be correctly classified by only feeding the TARPA classifier with 30% to 50% of the document contents. Keyword(s): TARPA, web document classification, tagged-region, progressive analysis, information retrieval, Nature Language Processing (NLP). Web Document Classification based on Tagged-Region Progressive Analysis Abstract In this paper, we propose an intelligent web document classification method, called TAgged-Region Progressive Analysis (TARPA). Instead of parsing the whole content of the web page while classifying a web document, TARPA parses the document into finer structured Tagged-Regions and extracts fewer and the most important regions to analyze and classify. If the few important tagged regions are not sufficient to allow TARPA to classify the document, other important regions and linked pages can be used for analysis progressively to enhance the classification performance. TARPA possesses two stages: learning stage and classification stage. The learning stage discriminates the importance of tags or pairs of tags, and the classification stage follows the importance order of tags to analyze the document. As a result, TARPA can classify a web document using few contents while with higher classification rate and shorter processing time. Experiments show that 94% of the testing web documents can be correctly classified by only feeding the TARPA classifier with 30% to 50% of the document contents.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Document Analysis And Classification Based On Passing Window

In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...

متن کامل

A New Document Embedding Method for News Classification

Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...

متن کامل

Developing a Recommendation Framework for Tourist by Mining Geo-tag Photos (Case Study Tehran District 6)

With the increasing popularity of sharing media on social networks and facilitating access to location technologies, such as Global Positioning System (GPS), people are more interested to share their own photos and videos. The world wide web users are no longer the sole consumer but they are producers of information also, hence a wealth of information are available on web 2.0 applications. The ...

متن کامل

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...

متن کامل

A Keyquery-Based Classification System for CORE

We apply keyquery-based taxonomy composition to compute a classification system for the CORE dataset, a shared crawl of about 850 000 scientific papers. Keyquery-based taxonomy composition can be understood as a two-phase hierarchical document clustering technique that utilizes search queries as cluster labels: In a first phase, the document collection is indexed by a reference search engine, a...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2002

Web Document Classification based on Tagged-Region Progressive Analysis

نویسندگان

چکیده

منابع مشابه

Document Analysis And Classification Based On Passing Window

A New Document Embedding Method for News Classification

Developing a Recommendation Framework for Tourist by Mining Geo-tag Photos (Case Study Tehran District 6)

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

A Keyquery-Based Classification System for CORE

عنوان ژورنال:

اشتراک گذاری